37
stochastic models such as hidden Markov models (Sean R Eddy 2004) allow hidden sys
tem states (e.g. exon, intron) to be predicted from a sequence (observations, e.g. ATCCCTG
...) using a Markov chain (Bayesian network; supervised machine learning). Hidden
Markov models are widely used for genome annotation (exon-intron region; e.g. GenScan
program), but also for protein domain prediction (e.g. Pfam, SMART, HMMER, InterPro
databases) and network regulation (e.g. signal peptides; SignalP, TMHMM programs).
In addition, there are numerous special software that detect RNA sequences (e.g. Rfam,
tRNAscan), viral sequences, repeat regions (e.g. Repeat Masker) and other sites in the
genome (e.g. enhancers, miRNAs, lncRNAs) and label them accordingly.
An important step is also to take a closer look at the promoter. Transcription factors
bind to DNA sequence motifs (Patrik D’haeseleer 2006) in the promoter (so-called tran
scription factor binding sites, TFBS) and thus regulate gene expression (transcription).
These conserved DNA patterns, usually consisting of 8–20 nucleotides, can be recognized
by computers using binding site pattern recognition algorithms based on experimental
data, such as chromatin immunoprecipitation DNA sequencing (Chip-Seq). A distinction
is made between probabilistic (binding site; position weight matrix), discriminant (sites +
non-functional sites) and energy (site + binding free energy) TFBS models (Stormo 2010,
2013). Databases such as Transfac and JASPAR contain the TFBS matrices for different
organisms. These can be used, for example, to search a sequence for TFBS to understand
gene expression (e.g. MotifMap, Alggen Promo, TESS, etc. programs), but also to find
possible regulation via modular TFBS (TF modules) (e.g. using the Genomatix program).
Besides, ab initio approaches (e.g. MEME Suite and iRegulon) try to find recurrent
sequence patterns in multiple sequences via multiple alignment, which are then compared
to known TFBS motifs for similarity. For example, we showed in one paper that heart
failure-associated Chast-lncRNA is regulated by promoter binding of Nfat4 (Viereck
et al. 2016).
In this way, from 1995 onwards (with E. coli and the yeast cell), the first genomes
began to be completely labelled and published. This was followed by the genomes of
eukaryotes (cells with a cell nucleus), which were about a thousand times larger, in par
ticular that of humans (2001) and many other higher organisms (fly, mosquito, mouse, rat,
chimpanzee, chicken, fish, etc.).
Another aspect is then to assemble the encoded proteins, RNAs and elements into
higher networks. For example, a single enzyme does not stand alone, but forms metabolic
networks (see next chapter). In the same way, a transcription factor that binds to the pro
moter of a gene does not stand alone, but is part of the overall regulation (so-called regula
tory networks, see next but one chapter). The precise description of individual genes often
requires not only DNA but also RNA (“transcriptome”), in particular in order to precisely
determine the beginnings and ends of the segments overwritten in RNA. An integrative
analysis yields the most accurate results here, even in the case of viruses with their com
pact genome (Whisnant et al. 2020).
3.1 Sequencing Genomes: Spelling Genomes